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(54) Method of using primary and secondary processors 

(57) The invention relates to the compilation of 
source code to a primary and a secorriai? processor. It 
relates to recorrfiguraWe secondary processors, and is 
especially relevant to secondary processors which can 
be reconfigured to some degree during execution of 
coda Selective extraction of dataflows from the source 
code is followed by transformation of the extracted data- 
flows into trees. The trees are then matched against 
each other to determine minimum edit cost relation- 
ships for transformation of one tree into another, where 
these minimum edit cost relationships are determined 
by the architecture of the secondary processor. A group 
or a plurality of groups of dataflows is determined on the 
basis of said minimum edit cost relationships and for 
each group a generic dataflow capable of supporting 
each dataflow in that group is created. The generic 
dataflow or dataflows is then used to determine the 
hardware configuration of the secondary processor; 
and calls to the secondary processor tor said group or 
plurality of groups of dataflows are substituted into the 
source code. The resultant source code is compiled to 
the primary processor. 

The resulting efficient configuration thus reduces 
either the expense of reconfiguration (in a field program- 
mable array), or the silicon area (in an application spe- 
cific integrated circurt). 
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Description. 



10001) The present invention relates to the compilafion and execution of source code for a processor architecture con- 
sisting of a primary processor and one (or more) secondary processors. The invention is particularly, though not exclu- 
sively, relevant to the architectures employing a reconftguraWe secondary processor. 

t0002] A primary processor - such as a Pentium processor in a conventional PC (Pentium is a trade Mark of Intel 
Corporation) - has evolved to be versatile, in that h is adapted to handle a wide rage of computational tasks without 
being optimised for any of them Such a processor is thus not optimised to handle efficiently conputationaily intensive 
operations, such as parallel sub-word tasks. Such tasks can cause significant bottlenecks in the execution of code. 
» [00Q3J.. An approach taten to solve this ^ 

ticular applications These are known as ASICs, or appfication-specffic integrated circuits. Tasks tor which such a ASIC 
is adapted are generaDy performed very weD: however, the ASIC wfll generally perform poorly, if at an. on tasks for which 
it is net configured, Clearly, a specific IC can be butt for a particular application, but this is not a desirable solution for 
applications that are not central to the operation of a computer, or are hot yet determined at the time of building the corn- 
is outer. It is thus particularly advantageous tor a ASIC to be reconftguraWe, so that it can be optimised for different appli- 
cations as required, The commonest form of architecture for such devices is the field programmable gate array (FPQA) 
a fine-grained processor structure which can be configured to have a structure which is suited to any given application' 
Such structures can be used as inclependent processore to suitabte coW 
use as coprocessors. . 
[0004) Such configurable coprocessore have the prtential to improve the For 
particular tasks, code run inefficiently by the primary pnx»ssor can be extracted and run more efficiently in an adapted 
coprocessor which has been optimised for that application. With continued development of such "appOcation-specff ic" 
secondary processors, the possibility of improving performance by extracting difficult code to a custom coprocessor 
becomes more attractive A particularly important example in general computing is the extraction of loop bodies in 
image handling. 

10005] To obtain the desired efficiency gains, it is necessary to determine as effectively as possfole how code is to be 
divided between primary and secondary processors, and to configure the secondary processor for optimal execution of 
Hs assigned part of the code. One approach is to mark the code appropriately on Hs creation for mapping to coproces- 
sor structures. In *A C + + compiler for FPGA custom execution units synthesis*. Christian (sen and Eduardo Sanchez 
IEEE Symposium on FPGAs for Custom Computing Machines. Napa. California. AprB 1995. a approach is errcloved 
which 'nvolvesmapping of C + + to FPGAs in VUW (Very-Long Instruction Word) structures after appropriate togging 
of the initial code by the programmer. This approach h^ies ot the inrtial progrein^ 
extract initially. 

[0006] An alternative approach is to assess the initial code to determine which the most appropriate elements to direct 
to the secondary processor will be Two-Level Hardware/Software Partitioning Using CoDe-X*. Reiner W. Hartenstein 
' jJlS.*^? 80 ^ a ^' ner ^ ress » I" W- IEEE Symposium on Engineering of Computer Based Systems (ECBS) Fh> 
dnchshafen, Germany. March 1996, discusses a codesign tool which incorporates a profiler to assess which parts of a 
iratel code are suitable tor allocation to a coprocessor and which should be reserved for (he primary processor This is 
wiowed by a iterative procedure allowing tor compilation of a subset of C code to a reconflgurable coprocessor archi- 
tecture so that the extracted code can be rnapped to toe coprocessor. This approach does expand to 
ondary processors, but does not fully realize the potential of reconfigurabie logic. 

100071 Comparable approaches have been proposed in the BRASS research project at the University of Berkeley. An 
^f^r^f^^ FPQA ^^PPN and Placement*. Tim Callahan & John Wawrzynek, a poster 
presented at FCCM*97. Symposium on Fieto^Progiammable Custom Computing Machines. April 16-18 1997 Napa 
valley. California (currently available on the World Wide Web at httpy/wwwcs'bert^ 
l^edu^jectsrorass^cjccm^er.truimbps). uses template structures representative of a FPGA architecture to 
assist in the mapping of source code on to FPGA structures. Source code samples are rendered as directed acyclic 
l^ri™ P^ 8 ' •?* ^ duced to frees - ^ese and other basic graph concepts are set out for example, in 'High 

IfT^ll^T^I?*™ Michael Wo "e. pages 49 to56. Addison-Wesley, Redwood City. 19% 

but a bnef definition of a DAG and a tree follows here 

3 9ra Pj. consiste <" a set of nodes, and a set of edges: each edge is defined by a pair of nodes (and can be 
cona^ed graphically as a fine joining those nodes). A graph can be either directed or undirected: in a directed graph 
^^^322^ • * P ° S ? W ! n^" 16 aP^^esraph from one node backto itself, then the graph is 
cycta. I not then Ihegraphe acyclic. A DAG is a graph that is both directed and acyclic: it is thus a Kerarcttcalsfruc- 
™ * 1 ,$ 1 SP ^S C tan ? 0fDAGAtreehasa ™<* termed "roof, and there is a unique pafofrom 
t'^S^ ?* ■ *« * an edge X->Y in a tree, then node X is termed the parentof Yand Yfe 

wnmed the child of X. In a tree, a "parent node" has one or more 'chfld nodes*, but a child node can have only one par- 
ent whereas m a general DAG. a child can have more than one parent Modes of a tree with no children are termed leaf 
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nodes. . ... \. • 

[0009] In the work Tim Callahan &Jc^ 

Tree coviering* program called Iburg. Iburg is a generally available sbftv^ 

Retargetable C.Compiler: Design arri bnplementatioh', Christopher W. Fraser and David ft Hanson; ^njanin/Cum- 
5 mings PubfisHng Co., Ina, Redwood City, 1995, especially at pp 373-407. Iburg takes as ir^ the source code trees 
and partitions this input into chunks that correspond to instructions on the target processor. This partition is termed a 
tree cover. This approach is essentially determined by the user-defined patterns allowable for a chunk, and is relatively 
complex: it involves a bottom-up matching of a tree with patterns, recording aO possfole matches/followed by a top- 
down reduction pass to determine which match of patterns provides the lowest cost. Again, this approach requires a 
io significant initial constraint in the form of the predefined set of allowable patterns, and does not fully realize the possi- 
bilities of a reconfigurable architectura 
10010] There is thus a need to develop techniques art 

terns involving a primary and secondary processor, by which an optimal choice can be made for aBocation of code to a 

secondary processor, which can then be configured as efficiently as possible to run the extracted code, with a view to 
75 maximising the performance effidency oT 

[001 1] Accordingly, the invention provides a method of compiling source code to a primary and a secondary pnxes- 
COff ^ ng: sel ^ c6ve extraction of dataflows from the source cxxteftansformation of the extracted dataflows into 

trees; matching of the trees against each other to determine mirimum edit oost relationships for transformation of one 

tree into another;determiniig a group or a plurality of groups of dataflows on the basis of said rninimum edit cost rela- 
20 tionships and creating for each group a generic dataflow capable of supporting each dataflow in that group; using the 

generic dataflow or dataflows to determine the hardware configuration of the secondary processor ; and substituting into 

the source code calls to the secondary processor for said group 

resultant source code to the primary processor. 

[001 2] This approach allows for optimal selection of source code dataflows for allocation to the secondary processor 
25 without prqudgement of suitability (by. for example, mapping onto predetermined templates) but while still taking full 
account of the demands and requirements of the secondary processor architecture. Advantageously, said minimum edit 
cost relationships are determined according to the architecture of the secondary processor, and represent a hardware 
cost of a corresponding reconfiguration of the secondary processor. The method is particularly effective if the minimum 
edit cost relationships arie embodied in a taxonomy of mirtmum edit distances for cte 
so [0013] The method finds its most usefJ application, where the hardware configuration of the secondary processor 
allows for reconfiguration of the secondary processor during execution of the source code, as this allows for reconfigu- 
ration of the secondary processor to be required during execution of the source code to sipport each dataflow in the 
group supported by a generic dataflow. The secondary processor may thus be an application specific instruction proc- 
essor, and the processor hardware may be a field programmable gate array or a field prog-ammaWe arithmetic array 
35 (such as that shown in the CHESS architecture discussed in Appendix A). 

[0014] Advantageously, the generic dataflow of a group is calculated by an approximate mapping of dataflows in the 
group on to each other, followed by a merge operation. 
[0015] An advantageous approach ^to 

clical graphs and reduce them to trees by removal of any Onks in the directed acyclical graphs not present In a critical 
40 Path between a leaf node and the root of a directed acycUcal graph, wherein a criti 

which passes through the largest number of intermediate nodes. Alternative criteria to the critical path can be adopted 
if more appropriate to the secondary processor hardware (for example, rf a different criterion can be found which is more 
sensitive to the timing of operations in the secondary processor): 

[0016] An advantageous further step can be taken after the creation of a generic dataflow,* in which the generic data- 
45 ,,w fe compared with further dataflows extracted from the source code, wherein those of said further dataflows which 
match sufficiently closely the generic dataflow are added to the generic dataflow. This enables more or afl of the code 
present in the source code which is suitable for allocation to the secondary processor to be so allocated. 
[0017] In the approaches indicated above, the removed links are stored after the cfirected acyclical graphs are 
reduced to trees and are reinserted into the generic dataflow after the merging of the trees of the group into the generic 
so dataflow. 

[0018] Specific embodiments of the invention are described below, by way of example, with reference to the accom- 
panying drawings, of which: 

Figure 1 shows a general purpose computer architecture to which embodiments of the invention can suitably be 
& applied; 

Figure 2 shows schematically a method of compiling source code to a primary and a secondary processor accord- 
ing to an embocfiment of the invention; 

Figure 3 illustrates a step of conversion of a DAG to a tree employed in a method step according to one embodi- 
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merit of the invention; 

Figure 4a illustrates the step of insertion and deletion of nodes and Rgure 4b illustrates the step of substitution of 
nodes in a tree matching process employed in a method step a^^ 

Figiffe 5 shows an edit distance taxonomy provided in an example according to an errtxxfiment of the invention; 
s Figure 6 illustrates a generic dataflow provided in an exairple according to one embodiment of the invention; 

Rgure 7 shows a logical interface for allocation of secondary processor resources for a generic dataflow acconfing 
to an embodiment of the invention; 

Figure 8 shows the application of DAGs to dataflows including multiplexers to handle conditional statements; and 
Figures 9a to 9d show an illustration of the merging of cancfidate dataf tows to f^ 
id. according to an entodimert / 

[OW q The present invention is adapted for compilation of source code to an architecture conprising a primary and 
a secondary processor. An example of such an architecture is shown in Rgure 1. The primary processor 1 isa conven- 
tional general-purpose processor, such as a Pentium II processor of a personal computer. Receiving cans from the prl- 

is mary processor 1 and returning responses to it are secondary processors 2 and (optionally) 4. Each secondary 
processor 2,4 is adapted to increase the computational power and efficiency of the architecture by handling parts of the 
source code not well handled by the primary processor 1. Secondary processor 4, optionafly present here, is a dedi- 
cated coprocessor adapted to handle a specific function (such as JPEG, DSP or the like) - the structure of this coproc- 
essor 4 will be determined by a manufacturer to handle a specific frequently used function. Such coprocessors 4 are 

so not the specific subject of the present application. By contrast, the secondary processor 2 is not already optirrised for 
a specific function, but is instead configurable to enable improved handling of parts of the source code not well handled 
by the primary processor. The secondary processor 2 is advantageously an application specific structure: it can be a 
conventional FPGA, such as the Xifinx 4013 or any other merhber of the XHinx 4000 series^ An alternative dass of 
reoonf iguraWe device, referred to as a field programmable arithmetic array, is described in Appendix A hereto. Such a 

25 secondary processor can be configured for high confutations efficiency in handing desired parts of the source code 
for an application to be executed by the architecture 

[0020] Also errployed in the computer architecture are memory 3. accessed by the primary processor 1 and. for 
appropriate types of secondary processor 2, by the secondary processor 2. and irput/output channel 5. Input/output 
channel 5 here represents all further channels and hardware necessary to enable the user to interact wimth^ 
sors (for example, by programming) and to allow the processors to interact with an other parts of the conputer device 6. 
[0021] The present invention is particularly relevant to the optimised partitioning of source code between primary 
processor 1 and secondary processor 2. which allows for optimal configuration of secondary processor 2 to optimise 
the handling of the application embodied in the source code by the architecture. A significant contrtoution is made by 
the invention in the selection and extraction of code for use in the secondary processor. 

[0022] The approach taken, according to an embodiment of the invention, is set out in Figure 2. The initial input to the 
process is a body of source code. In principle, this can be in any language : the exanple descrfoed was carried out on 
C code, but the person skilled in the art will readily understand how the techniques desobed could be adopted with 
other languages. For example, the source code could be Java byte code: if Java byte code could be so handled the 
architecture of Rgure 1 could be particularly well adapted to directly receiving and executing soiree code received from 
40 the internet 

[0023] As can be seen from Rgure 2, the first step in the process fe the identification of appropriate canddate code 
to be executed by the secondary processor 2. Typically, this is done by performing dataflow analysis on the source code 
and building appropriate representations of the dataflows presented by selected lines of code Cm most processes this 
is normally preceded by a manual profiling of the code). This is a standard technique in conpfling generally, and appl- 
cation to secondary processors is discussed in, for example, Athanas et al. "An Adaptive Harxiware Machine Architec- 
ture and Compiler for Dynamic Processor Reconfiguration-, IEEE International Conference on Conputer Design, 1991 
pages 397-400. 

[0024] The approach taken here is to build directed acyclical graphs (DAQs) which represent the dataflows of selected 
wdaAn advantageous way to do this fe by using a compiler infrastructure appropriately configured for the extraction 
of dataflows: an appropriate compiler infrastructure is SUIF, developed by the University of Stanford and documented 
at Wortd ^6 Web she httpy/suif.stanfordedu/ and elsewhere. SUIF is devised for conpfler research 
for higtvperformance systems, specifically including systems comprising more than one processor. A standard SUIF 
utility can be used to convert C code to SUIF. ft is then a simple process for one stalled in the art to use SUIF tools to 
« ™w« ^ performino a datafi ow analysis over sections of SUIF and then recorcfing the results of the analysis. 

The extraction of DAGs from source code is a conventional step. The next step in the process, as can be seen 
from Rgure 2, is the conversion of these DAGs into trees. This step is a significant factor in making the optimal choice 
or code tor execution by the secondary processor 2. DAQs are complex structures, and difficult to analyse in an effective 
manner. Reduction of DAGs to trees allows the aspects of the dataflows most important in determining their mapping 
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to hardware to be retained, while amplifying the structure sufficient to allow analytical approaches to be made signif- 
icantly more effective. 

[0026] Discussion of ttte reduction of DAGs to trees is made in "High Performance Compilers for Parallel Computing" 
(as cited above), especially at pages 56 to 60. Different terminology is used here from that used in the cited reference, 
but equivalent and comparable terms are indicated bekw. The type of trees constructed here are directly conparabfe 
to the "spanning trees" referred to in the cfted reference. 

[0027] The preferred approach foflowed in the reduction of DAGs to trees is the removal of links hot in the critical path 
between leaf nodes and the root: this is illustrated in Figure 3. The critical path between nodes A and B is in a first 
embodiment of this reduction process defined as the one that touches the maximum number of nodes. As a DAG is, by 
definition, acyclic, distinct paths can be defined to meet this criterion. It is possUe for there to be different paths 
between nodes that have the same maximum number of nodes, but th^ paths are like^ 
purpose of tree construction. While making an arbitrary selection between these paths is a valid approach, a key issue 
In mapping the source code successfully is scheduling, which depends on tinting information: accordingly, where It is 
necessary to make.a choice between alternative "critical paths" It Is desirable to choose the one that would take the, 
longest time (in terms of time taken to execute each of the operations represented by the nodes in the path). As is dis- 
cussed further below, alternative approaches can be adopted which are based more directly on timing information. It is 
also desirable to adopt a consistent approach in making such choices - otherwise morphologically different trees can 
result from essentially similar DAGs. 

[0028] The process taken in applying this first enrtwliment of the critical path criterion is as follows. Firstly, for every 
leaf node, every possfole path towards the root is chased: as the DAG is a directed graph, this is straightforward. As 
indicated above, for each leaf node the path with the greatest raimber of ^ 

have the same number of nodes, a selection is made. This is the critical path for that leaf node. AD other paths not 
selected are cut in their edge closest to the starling point This cut edge is termed a minor Gnk (equivalent to the term 
"cross-link" in the Wolfe reference). The tree consists of the assembly of critical paths, and contains no minor links. The 
minor finks are stored separately. Minor links will be required when extracted source code is mapped to secondary proc- 
essor 2. but are not used in determining which source code is to be mapped to the secondary processor. 
[0029] It is of course possible to construct aces from DAGs without using the critical path criterion. Use of the critical 
path does provide particular advantages. In particular, removal as minor finks of the cross-fihks not in the critical path 
will have little effect on scheduling, whereas H another approach was adopted removed cross-links may have a consid- 
erable influence on timing and hence on scheduling. Use of the critical path criterion aOows construction of a tree which 
represents as best possible the critical features of the DAG in the context of moping to hardware 
{0030] Figure 3 shows the appfication of the pn^ 

shows three lines under consideration for execution by secondary processor 2. DAG 12 shows these three fines of code 
represented as adirected acyclica! graph, with root 126 (variable e) and leaf nodes 121, 129 and 130 as the iiputs: 
[0031] It is now a straightforward matter to assess each path from a given leaf node to the root, and to conpare the 
number of nodes in each path. From node 129 (integer value 2), there is only one path, through nodes 122 123 124 
and 125. This is then the critical path from leaf node 129 to root node 126, and will be present in the tree. Rom node 
121 (in the present case the result of an earlier operation and designated c), there are two paths. The fust path passes 
throughnodes 122, 123. 124 and 125, whereas the second path passes through nodes 127, 128 and 125. The first path 
is the critical path, as it passes through more nodes: the second path can thus be cut, as is discussed below The 
remaining leaf node 130 (variable b) also has two paths available: one passes through nodes 123. 124 and 125 
whereas the other passes through nodes 127, 128 and 125. These are equivalent in terms of nurrtoer of nodes and so 
either path can be chosen as the critical path: however, for reasons discussed above (timing and morphological con- 
sistency) it is desirable to operate under an appropriate set of further rules to make the best selection Such further 
rules may, tor example, be determined on the basis of the relevant haidwara Here, the second path is chosen. 
[0032] The next step to take is to construct a tree 1 4 from the critical paths chosen from the DAG 12. This is done by 
cutting all noncritical paths in their edge closest to the starting point (that is. the edge closest to the starting point which 
is not also part of a critical path) r The first non-critical path to consider is that from node 1 21 to root 1 26 through nodes 
1 27. 128 and 125. This can be cut on the edge between nodes 121 and 127 - in the tree, this is represented by removal 
of edge 151 between nodes 141 (corresponding to 121) and 147 (corresponding to 127) which is stored separately as 
a minor link. The other non-critical path to consider is that from node 130 to root 126 through nodes 123, 124 and 125 
this can be cut on the edge between nodes 130 and 123: Again, this cut edge is stored as a minor fink. 
[0033] It should be noted that conditionals can be represented in DAGs and so reduced to trees in exactly the same 
way as simple equations. An example is shown in Figure 8: this is a DAG representing the dataflow of the lines. 
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a = b 

!'.... ' a = y 

10 



and shows a multiplexer node 186 and a less than' operation node 186 in addition to the variable and integer nodes 
181. 182, 183 and 184. As the skilled man wffl appreciate, it win generally be possible to use the approach shown here 
; for source code which can be represented as a DAG. 

[0034] tree stnjcture that felett-mthfe case, tree 14 -te 
is source code should be mapped to secondary processor 2. as is discussed turther below. The technique described 
above is a particularly appropriate one for converting DAQs to trees, as it is straightforward to implen^ 
appDcation. and through use of the critical palhnraiiitairetheiTKodmum^ 

theslsed (assuming each node represents a single computational element) because of the inclusion of paths with the 
maximum number of nodes. As the person sWBed|ntheertwiB appreciate. eJternathre approach 

so edges are to be removed in converting the DAQs into frees can be adopted. One alternative embodiment of the DAG to 
tree reduction process is to assign a timing-based weight to every node (based, for example, on the length of time 
required to execute the corresponding computational element) and then to compare the accumulated weights of each 
path, selecting a path to define the tree accordingly on the basis of. for example, greatest accumulated weight This 
approach may be more appropriate if the ti^ 

25 tor and m particular if the timing dependencies are not mainly related to the mode counted (which may the case in struc- 
tures where, for example, multiplication is several times more time consuming than addition). 
IP035] The nextstepm trie cornpta^ 

the selection of source code for the secondary processor 2. As is further illustrated in Figure 2. this step of the process 

30 dxiate dataflows. This is a significant original step, and is discussed in detail below 

[00361 Theobjective in this stage of the compilation process is todetermine as best possible w^ 

dataflows from the souroe code would be the best choices for execution by the secondary processor. This is to a large 

code to the secondary processor 2 can be made where dataflows are sufficiently similar that broadly the same hard- 
ss ware representation can be used for each dataflow. It therefore follows that good choices of candidate dataflows for 
mapping to the secondary processor can be made by finding sets of dataflows that are sufficiently similar to each other 
Thrsis what is achieved by analysing and classifying the trees resulting from the candidate dataflows. ■ 
^ 37 i A ^ e ^ l '! ctiniq " e *»fn**hB trees, used in this embodiment of the invention, is the tree matching algo- 
rithm devised by Kaizhong Zhang of the University of West Ortario, Canada. 

£S * 1 ^^^° rH «Il5v desCribed h ******* Zhan 9. " A Constrained Edit Distance Between Unordered Labelled 
^^, ^gornhr™ca(l 996) 15^05-222, Springer Verlag, and is provided as a toolkit by the University of West Ontario, 
ttie toolkit being at the time of writing obtainable ever the internet from ftp^/tto.c^.iJwacaA>ubA2hangm^Etool.tar gz 

a^e^e stalled man. The approach to tre« mato^^ 

?S2 * ° perati0n * Zha " 9 ' S al80rfthm fe the followin 9 : *»° frees compared node*y-node through 

agamic programming technique that minimises the ed» operations required to transform one tree into another This 
co^d ^rmat,on « termed here an edit cost The edit costs of successively larger subtrees are cross-compared. 

09 ^ ^ m ' nimum 00818 found - ^ com P utational structure can be characterised as that of a 
„ j^T™ P*^™ "** a wrW "S dynamic programming grid to calculate component subtree distances 

50 and records the result on the main grid. . • 

mo] The edit operations available are insertion, deletion and substitution. These are shown in Figures 4a and 4b 

L^L 3 'I'? 6 be1W6en ^ 3 and 5 of tree 151 : this new node gives the structure of tree 152 Con- 
sequently tran^rm^on ottree 151 to tree 152 is achieved by insertion of this node, and transformation of tree 152 to 
T^l J *^lT -d b /* elebon d ' A (inthe CHESS architecture described in Appendix A "deletion" is represented in 
todvrara by -bypass" of a unH of the array: this is an example of an archftecturafly designed cost - inttrfs^e a 
51°?- F ° r R9Ure ^ »e two trees 1 51 and 1 52 have the same structure, but the two nodes 4 reprlert 
a different type of operation tn each tree: a is therefore necessary to substitute for node 4 in transforming onetree to 
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the other. Every node therefore heeds a TabeP: a tag attached to the node which identifies the type of node among the 
. various, types of node possible. 

[0041] As previously iixfieatecl each of these ^ 

for example, the. same result may be achieved in some architectures either by an insertion and a deletion, or by a sub- 
stitution: the costs of these different alternatives, can be compared. 

[0042] The result of me comparison of two trees by this algorithm is the production of a fist ofpairs of nodes (11 12) 
where tl belongs to the first tree and t2 belongs to the second tree. Each pairing constitutes an identification of similar 
points in the two trees, suggesting the mapping of; t1 and 12 on to each other. The fist of pairs effectively defines the 
skeleton of a tree which can contain either of the compared trees: in this skeleton, to transform the first tree into the 
) second tree, each node tl has to be substituted with the respective t2. Nodes that do not occur in the mapping must be 
either inserted or deleted depending on which tree they belong to as is discussed further below. For this list of pairs 
there wiD be defined an editdistance: this is the minimum in edit costs cumulated over toe pairs necessary to transform 
one tree to the other. The algorithm is devised to determine an editdistance between two trees, together with the set of 
transformations which achieves that edtt distance: alternative transformations wfll be possible, but they will have a 
.higher associated cumulative edit cost 

[0043] The value of computing an edit distance based on edit costs is that the edit costs may be chosen to represent 
the "hardware cost" in reconfiguring the secondary processor from the Configuration representing one tree to a config- 
uration representing the other tree in a mapping. This Tmroware cost" is typically a measure of the quantity of second- 
ary processor resources that wfll be taken up to achieve the second configuration given the existence of the first - this 
can be considered, for example, in terms of the additional area of device used. These costs will be determined by the 
nature of the secondary processor hardware, as for different types of hardware the physical realisation of insertion 
deletion and substitution operations win be different For the reconfigurable CHESS array discussed in Appendix A, a 
"bypass' operation involves minimal cost, a substitution between an adds and subs (addition and subtraction opera- 
tions) has low cost whereas substitution between mute endows (multiplication and Division operations) is expensive. 
[0044] As indicated above, an edit distance between two trees can be constructed. However, a further step can be 
taken: using Zhang's algorithm, or a comparable approach, a taxonomy can be built to show the edit distances between 
each one of a set of trees. This taxonomy can readily be provided in the form of a tree, of which an example is shown 
in Figure 5. Each leaf node 161 of the tree represent a candidate tree extracted from a DAG. and each intermediate 
node 162 represents an edit cost The tree provides a unique path between each pair of leaf nodes. The edit distance 
between the two leaf nodes of a pair is found by nation of costs provided at each intermediate node on this pato. For 
example, the editdistance between any pair of the leaf nodes representing Tree#4. Tree#5 or Tree#6 is 6 However the 
etfrt distance between Tree#1 and Tree#4 te 496: toe sunimation ofi^^ 
and6. 

[0045] Thfe taixohomy is indicative of the number of edit operations required to translate between trees. Such a tax- 
onomy is a valuable tool, as it can be used heuristicaJly es a metric for the degree of variation between candidate trees 
The creation of a taxonomy thus renders it easy to determine which trees are sufficiently similar to be consofidated 
together (as will be discussed below), and which are too diverse for this purpose. This can be done by imposition of an 
edit distance threshold. A group of trees can be selected for consolidation if the edit distance between each id every 
possible pair of trees in the group is less than the edit distance threshold. The value of the edit distance threshold is 
arbitrary, and can be chosen by the person skilled in the art in the context of specific primary and secondary processors 
in order to optimise the performance of the system. 

[0046] The advantage of consolidating a group of trees fe that a common hardware ccrtigu^ 
whole group and will support the function of each tree. This is particularly appropriate for architectures, such as 
CHESS, m which low-latency partial reconfiguration mechanisms are available on the secondary processor Reconfig- 
uration is required to change the configuration from that to support the function of one tree to that to support the function 
of another tree: however, as the edit distance between these trees will never be greater than the edit cost threshold, the 
degree of reconfiguration required is already known to be within acceptable bounds. The group of trees are consoli- 
dated together by construction of a "supertree" which contains a representation of every component tree. After it has 
been constructed, the supertree can be converted into a representation of each of toe relevant DAGs extracted from the 
source code by reinsertion of the previously removed minor links. The hardware configuration may then be determined 
from the fuD supertree. The construction of the supertree is discussed in detail below. 

l ^ 7 L^ F l 9Ur l 6 ' llus,rates *P ^ construction of a supertree from a group of trees which fall below the specified 
etft cost threshold: such a group of trees is here termed a class. The trees 171. 172 and 173 can all be mapped 
together into supertree 1 70. The reconfiguration required to change the hardware configuration from that to support for 
example, tree 1 71 to that of tree 172 is sufficiently limited to be realizable in practice, because the edit distance between 
the two trees is below the edit cost threshold. 

[0048J An exemplary supertree assembly algorithm, merge, is provided as C code in Appendix B. The function of the 
algorithm is described below, with reference to Figure 9. The algorithm contains the following elements: 
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merge: 

[0049] The treej the class with the largesl number of nodes is chosen to be the inftial merge tree - H there are trees 
wrth an equaJ number of nodes, an arbitrary selection can be mada The remaining trees are termed source trees. 
[0050] For each source tree the following operations are then applied: ~^«?irees. 

Bom the mapping between the source tree and the merge tree which has been calculated fin this entxxtiment, 
SSls^^ tad " e *-«« B determined from the secondary processor architecture), the superfreeis 

1. Firstly, mapped nodes closest to the root are considered; 

2. The source tree operation (source operation) is concatenated to the conesponding maoDed merae tree 
operation (merge operation); . ; , «y««co 

3. For each child operation of the source operation 

a. H the child is mapped, revert to step 2 with respect to the source child 

B. If there is a further mapping inside the source subtree, connect the subtree as follows: 
a If the meige operation of ^ 

remove the mapped source operation from the source tree. There is recursion present at this 
stage - where mapped children have already been dealt with, all that needs to be done is to 
remove what would otherwise be a cross tree link. 

tx This is shown in Figure 9. H the merge operation of this subordinate mapping does fall within 
meprevrously mapped subtree, climb up the merge tree unta the least common ancestor for all 
coined subordinate mappings is fburid Tlie least wnimon ancestors 
all of *ie source mappings. The unmapped source segmem is then mapped into the merge tree 
by finking the source operation of the unmapped sourx» subfree as a child erf me least common 
ancestors parent and by linking the least common aiicesh>r as the <^d of me unmapped source 
operation just above the closest mapped source openation in me cunent subtree (where the 'clos- 
eamappedsourceoperatfon-defi^ 

and rs a mapped node which falls within the subtree of the current mapping - the source node* 
parent, which is unmapped, adopts the merge tree* least comnwn ancestor as a child and vice 
versa). 

Thepafrofirterrrtrigledt^ 

]^d"re continues until an the source trees in me class are cont^ 



mn This process « indicated in Figure 9. Figure 9a shows two dataflow trees, a merge tree 201 and a source tree 
StISr ** "J?T "** b6tWeen 00(168 made * *>* comparison algorithm - the remaSg nSTnS 
tobe^rtappropnate^. As indicated in section 1 above, thef frst step is to consider the mappe^SoreTeare^ 
the root - in th.s case, at the root These operations A are concatenated openraons nearest 

Iw)52]^Ater this, the child nodes of A in the source tree are considered. Node B does not have a mapping and is not 
an ancestor to any mappings - it is therefore merged as a child of A* (see Figure 9b) The other cWw3^ a r 

operatons fafl m fte P rev,ously mapped subtree (as they are both descendants of A). It is therefore necessXvtoS 

IrZ , : SeCfi ° n ^ m03) abo " a ne ,east ***** containing both ma^ mSpe^oS 

Dand E * X. C of the source tree is thus linked into the merge tree as child of ArAfthe pareSTx) an3 pSf7 

a supertree node - merging ,s thus entirely straightforward, and consists only of concatenation f,e suSS 
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process continues unto aD the candidate trees are merged into a supertree. 
[0054] Atthistoge.oispb^ 

?^?22^J ^^^^^^^^^^n^alsorunnwequic^if 
a^adapt^secc^a^pwessorrathertr^oh 

renraumng DAGJ wrth the s^rtree by a backmapping process. Processes derived tom33LS23SSS 

to usepf ZhangfeaJgonthm. and match fu^er^ 
» .^al^eredft^eshcw^^ 

tree, orwherethe edrt<»sttosucha niapping late 

such dataflows added by this backmapping process needs to be stored also. •"ormaDon reiateotoany 

is ^rconveraon into trees (including here any DAGs added from the backmapping process, if employed) TherSnn 
^ejs^cla^dataflc^ 

tethes^^^ e . to detenrtr»anyrecdr^^ 

** PtUP0Se * determinin9 ilhe haraVrare <^^fc" of the secondary p^^S^S, 
*^tPP"^eastn^efor 

a> essor: these steps are described further below. , , . -\ ^ seconoaiy proc- 

[0056] Stitching calls to the secondary processor back into ttie eourco code in fed ^ 

K ^^i*^, ,TObves and of the output of the dataflow (root of the relevant tree) with a read.^nek*Z and 
codesub^,n*edataflowc^ 

labelled Input Tree » #3. is shown, together with a supertree, labelled PFU Tree. Each node in Input Tree #3 haste owri 
30 other Ifc resources are allocated to the leaves and the root The implicit mapping beS^ee ^ndlSj^ 

^^^^^ 
« "SI ^«he class dataflcw.H is possible to configure the secondary processoTThte 

atons. and including m appropriate form any dynamic reconfiguration instructions) and then 27*2^ 
^secondary procassorhard^ 

r^tTtoe^ 

« JSl?J3!2 >nr,gUrab,e C ° mputer * ^ CasseJmar, currently available ^ L WoT VvS^S 
ESlS"^^ a contribution to the v£r« v£e^£ ro^dta* 

SjLT ^T"! C ° mputin9 operated * SB Abates, he. of 504 Nino Avenue. Los^cTca^pTS 

n^^J^^?^*** < * ,mrt * 01 <^«^e and^eco^e p^S 

* USin9 100,5 "PP"*** *> the processor concerned. POCeSSOr>SUCn 
[0059] Once the source code is generated in executable form with appropriate calls to the secondary Droencenr ant 
"*» thesecondary processor configuration has been determined. tf^,^cX£ be SSStSa^S £ 
source code ,s executed in the primary processor with caDs to coprocessors anoTe ^ZT^S^^Z 

nificantty increased. For example, a 25% improvement was found in application of the method <Jfl?emhSi^^ 
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APPENDIX A , 
CHESS may ' 

The CHESS array is a variety of field programmable array in which the programmable 
elements are not gates, as in an FPGA, but 4-bit arithmetic lo^ units (ALUs). The array 
configuration is described in detail in European Patent Application No. 97300563.0, and the 
ALU structure and provision of instruction to ALUs is discussed in a copending application 
entitled "Reconfigurable Processor Devices" and filed on die same date as die present 
application. 

The CHESS array consists of a chessboard layout with alternating squares comprising an AL U 
and a switchbox structure respectively. Tte configuration memory for an adjacent switchbox 
is held in the ALU. Individual ALUs may be used in a processing jpipdine, and in a preferred 
impleinentation, provision is made to allow dynamic provision of instructions from one ALU 
to determine the function of a succeeding ALU. ALUs are 4-bit, with four identical bitslices, 
with 4-bit inputs A and B taken directly from an extensive 4rbit interconnect wiring network, 
and 4-bit output U provided to the wiring network through an optionally latchable output 
register: 1-bit carry input and output are also provided and have their own interconnect 

Dynamic instructions are providable from die output U of one ALU to a 4-bit instruction kput 
I of another ALU. The cany output C M of one ALU can also be used as C* of another ALU 
with the effect of diangingtheinstructionof that ALU. 

The CHESS ALU is adapted to support multiplexing between A and B inputs, and also 
supports multiplexing between related instructions (eg OR/NOR, AND/NAND). 
Reconfiguration between such instructions can be achieved through appropriate use of the 
carry inputs and outputs without consumption of silicon. More complex reconfigurations (eg 
AND/XOR, Add/Sub) can be achieved through using two ALUs, die first to multiplex between 
the two alternative instructions and the second to execute the chosen instruction on the 
operands. Multiplication will take up more than a single ALU, making reconfiguration 
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involving a mulfylicatian operation more complex. It is straightforwani using the multiplexer 
capacity of a CHESS ALU to "bypass" an operation, with appropriate control resulting in 
either performance of the .-operation or propagation of a given input 

A sample set of functions obtainable from the instruction inputs is indicated in Table Al 
below: a wide range of possibilities are available \yith appropriate logic in connection of the 
instruction inputs to the ALU. The functions are described in f able A2. 
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Table Al: Instruction bits and corresponding functions 
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I Name 


1 '* 1 1 'fiirirttmi . 


Cpot function 


I ADD 


I A nhic H 

1 «l J/1115 D 


j Aritbmetic carry 


| SUBA 


1 • A mimic R 


j , Arithmetic carry 


[ A AND B 


1 IF. — A A kjt\ D 
r Vi — Aj AIND Bj 


J Cpm Cm , 


| A ORB , 


1 IL — A AD D 
| U| — Aj UK l>| 


| . Qout = Cfo 


J ANORB 


1 II-' — MAT /A AD n \ 
I U, — JNUI (AjURBj) 


j Pant = Op 1 


I AXORB 


1 IF. — A vnb t> 
J vi — Aj AUK Bj 




I ANXORB 


1 IT."" MAT / A" v/*it» wk v ' 

| U| - NUT (Aj XOR B|) 




I A AND B 


Uj - Aj AND (NOT B|) 


Coot = Q,, J 


I BAND A" 


| Uj - (NOT A|) AND B| i 




I A ORB J 


Uj - (NOT A]) OR Bj 


Com = Cj,, I 


j Bora 
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Cout = Cjn 1 


I '-■ A I 
1 ' 1 
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^out 88 C to 1 


1 b I 


u i = Bj 
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TJ«*s= NOtT T 
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_ Com — Qn r 
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Not applicable 


if A = = B then 0, else 1 J 


MATCH! 1 


Not applicable 


bitwise AND of A and B, I 



MATCHO 



Not applicable 



followed by OR across 

width of the word 
bitwise OR of A and B, 
followed by an AND across 
the width of the word 



Table A2: Outputs for instructions 
taari<taeuc. TTie MATCH functions are so-called because for MATCH 1 the value of 1 is 
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Claims. 

1. A method of compiling source code to a primary and a secondary processor, con^rising: 

selective extraction of dataflows from the sourcecode; 
transformation of the extracted dataflows into trees; 
matching of the trees against each other to determine rrtnimum ecfit cost relationships for transformation of one 
tree into another; . . : . ^ v 

determining a group or a plurality of groups of dataflows on the basis of said minimum edit cost relationships 
and creating for each group a generic dataflow capable of supports 
using the generic dataflow or dataflows to determ^ 
* and '.. ,; ■. • "* 

substftuting into the souro 

flows, and compiling the resuttant source code to the primary processor. 

2. A method as claimed in claim 1,wher^ 

imum edit distances for classification of tiie trees. 

3. A method as claimed in claim 1 or claim 2, wherein said minimum edft (»st relationships are determined a^ 

to the architecture of the secondary processor, and represent a hardware cost of a corresponding reconfiguration 
of the secondary processor. 

4. A method as daimed in any of claims 1 to 3, wherein the hardware confg^ 
for reconfiguration of the secondary processor during ex 

5. A method as claimed in claiim 4, wherein the secondary processor is an application specific instruction processor. 

6. A method as claimed in claim 4, wherein the secondary processor is a field programmable gate array. 

7. A method as daimed in claim 4, wherein the secondary processor is a field programmable arithmetic array. 

8. A method as claimed in any of daims 4 to 7, wherein reconfiguration of the secondary processor is required during 
execution of the source code to support each dataflow in the groMp sup^ 

9. A method as daimed in any preceding claim, wherein a generic dataflow of a group is calculated by an approximate 
mapping of dataflows in the group on to each other, fonowed by a merge operation. 

.• ' \ '•'*'..*.•' ** 

1a A method as claimed in any preceding daim. wherein the dataflows are provided as directed acydical graphs and 
are reduced to trees by removal of any links in the directed acydical graphs not present in a critical path between 
a leaf node and the root of a cfirected acydical grapfv 

11. A method as daimed in claim 10, wherein the critical path Is a path between two nodes which passes through the 
largest number of intermediate nodes. 

: '■' ' ' ' ■ ■ ' ; : ' )- ) - • ■ 

1 2. A method as claimed in daim 1 0, wherein the critical path is a path between two nodes with the greatest accumu- 
lated execution time. 

13. A method as claimed in any of claims 10 to 12, wherein after the creation of a generic dataflow, the generic dataflow 
is compared with further dataflows extraded from the source code and provided in the mamer defined in claim 10, 
wherein those of said further dataflows which match sufficiently dosery the generic dataflow are added to the 
generic dataflow. 

14. A method as daimed in any of daims 10 or claim 13 where dependent on claim 9. wherein the removed links are 
stored after the directed acydical graphs are reduced to trees and are reinserted into the generic dataflow after the 
merging of the trees of the group into the generic dataflow. ;• 
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Figure: 9a Two Dataflow frees 




Figure: 9d Next Candidate Mapping 
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